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ABSTRACT 

In order to efficiently study the characteristics of network domains 
and support development of network systems (e.g. algorithms, pro- 
tocols that operate on networks), it is often necessary to sample a 
representative subgraph from a large complex network. Although 
recent subgraph sampling methods have been shown to work well, 
they focus on sampling from memory-resident graphs and assume 
that the sampling algorithm can access the entire graph in order 
to decide which nodes/edges to select. Many large-scale network 
datasets, however, are too large and/or dynamic to be processed us- 
ing main memory (e.g., email, tweets, wall posts). In this work, 
we formulate the problem of sampling from large graph streams. 
We propose a streaming graph sampling algorithm that dynami- 
cally maintains a representative sample in a reservoir based setting. 
We evaluate the efficacy of our proposed methods empirically us- 
ing several real-world data sets. Across all datasets, we found that 
our method produce samples that preserve better the original graph 
distributions. 

1. INTRODUCTION 

Many real-world complex systems can be represented as graphs 
and networks — from information networks, to communication net- 
works, to biological networks. Naturally, there has been a lot of in- 
terest in studying characteristics of these networks, modeling their 
structure, as well as developing algorithms and systems that operate 
on the networks. While the recent surge in activity in online social 
networks (e.g., Facebook, Twitter) has prompted a similar need for 
characterization and modeling efforts, it is often much harder than 
in traditional networks due to their size. Specifically, these net- 
works tend to be too large to efficiently acquire, store and/or ana- 
lyze (e.g., one billion chat messages per day in Facebook [35)). It 
is therefore often necessary to sample smaller subgraphs from the 
larger network structure, that can then be used to investigate the 
characteristics and properties of the larger network. It can also be 
used to drive realistic simulations and experimentation before de- 
ploying new protocols and systems in the field — for example, new 
Internet protocols, social/viral marketing schemes, and/or fraud de- 
tection algorithms. 
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In this work, we consider the following graph sampling prob- 
lem: Assume an input graph G = (V, E) of size N = \V\, from 
which a sampling algorithm selects a subgraph G s = (V s , E s ) with 
a subset of the nodes (V s C V) and/or edges (E s C E), such that 
| Vs | = 0AT. We refer to (f> as the sampling fraction. The goal 
is to sample a representative subgraph G s that matches many of 
the properties of G, so that G s can be used to simultaneously pre- 
serve several characteristics of the network structure in the original 
graph G (e.g., degree, path length, clustering). Specifically, we 
aim to select a G s that minimizes the distributional distance over 
several graph measures (e.g., degree distribution) simultaneously. 
Let /(.) be a property of a graph, then the goal is to select a sam- 
ple that minimizes the distance between the property in G and the 
property in G s : dist[f(G), f(Gs)]- In this work, we consider de- 
gree, hop plot, and clustering distributions for /(.) and evaluate us- 
ing two distributional distance metrics — Kolmogorov-Smirnov dis- 
tance and skew divergence |23| . 

While many graph sampling methods have been proposed before 
(e.g., |17|[26 |), they typically require access to the whole graph in 
its entirety at any step, in order to decide which nodes/edges to 
select. While the graph data can be stored on disks, processing 
full large graphs is usually done using physical memory (RAM) 
which is a limited/expensive resource. Therefore, the ideal ap- 
proach to process the graph data is to use a streaming model, where 
the graph data is presented as a stream of edges, and any computa- 
tion on the stream relies on using a small amount of memory, and 
in a single pass. Many large-scale network datasets readily admit 
such a streaming model. For example, online social network ap- 
plications (e.g. Facebook, Twitter) have data that consist of micro- 
communications among users (e.g. wall posts, tweets, emails); any 
activity between two users can result in an edge getting added to 
the activity graph. 

In this work, we consider the problem of sampling from such 
large social activity streams. We refer to social activity streams as 
graph streams since social activities can be represented as a graph. 
Specifically, our goal is to devise a streaming algorithm for sam- 
pling subgraphs from large graph streams, that can decide whether 
to include an edge in the sampled graph, as the edge is streamed in. 

While there is a great deal of research on data streams and data 
stream management, to our knowledge, our work is the first work to 
focus on streaming algorithms for sampling subgraphs from large 
graph streams in a single pass. Satisfying the dual objective of find- 
ing a sampling algorithm that can sample representative subgraphs, 
while being amenable to a streaming implementation is quite chal- 
lenging. Most existing sampling algorithms fail to process graph 
streams (i.e. requiring multiple passes over the edges). As an ex- 
ample, breadth-first search needs to access the full neighborhood of 
a node to perform one step of its process. 



In this paper, we propose a novel sampling algorithm that is 
amenable to streaming implementations. Specifically, we propose 
partially-induced edge sampling (PIES) that randomly samples edges, 
induces the sampled nodes, and maintains a dynamic/changing sam- 
ple in a reservoir-based setting using a single pass over the edges. 

Our proposed approach is simple, efficient, and can be used to 
sample large graphs that are too large to fit in memory. Moreover, 
it can also be used to graphs that readily admit the streaming model 
(e.g. email logs, tweets between users in Twitter). 

We evaluate PIES over a number of real world (e.g., Facebook, 
Twitter, HepPH, Flickr) datasets collected by other researchers (|TJ 
|39| ), and an email network constructed from two weeks of Purdue 
University email traffic. We compare PIES to existing/proposed 
baseline stream sampling techniques such as edge sampling, node 
sampling and a simple breadth-first search (BFS) based algorithm. 

Across all datasets, we observed that PIES produces samples that 
better match the distributions of degree, path length and clustering 
compared to other existing algorithms. 

The rest of the paper is organized as follows. We first present a 
background and related work on sampling methods in section [2] . 
Next, we outline our proposed sampling algorithms with streaming 
implementations in section|3] Finally, we compare PIES with other 
baseline graphs sampling algorithms in section]?] 

2. GRAPH SAMPLING ALGORITHMS 

In this section, we discuss standard graph sampling algorithms 
in literature, which can be broadly classified as node-based, edge- 
based, and topology-based methods. Most graph sampling algo- 
rithms have two basic components: (1) node selection, and (2) sub- 
graph formation. The node selection step identifies a sample set 
of nodes (V s ), while the subgraph formation step selects the set 
of edges (E s ) to be included in the sampled subgraph. We distin- 
guish between two different approaches to subgraph formation — 
total and partial graph induction — which differ by whether all or 
some of the edges incident on the sampled nodes are included in 
the sampled graph. The resulting sampled graphs are referred to as 
the induced subgraph and partially induced subgraph respectively. 

Node sampling (NS). In classic node sampling, nodes are chosen 
independently and uniformly at random from the original graph for 
inclusion in the sampled graph. For a target fraction <f> of nodes re- 
quired, each node is simply sampled with a probability of tj>. Once 
the nodes are selected, the sampled graph consists of the induced 
subgraph over the selected nodes, i.e., all edges among the sampled 
nodes are added to form the sampled graph. While node sampling 
is intuitive and relatively straightforward, the work in 1 37 1 shows 
that it does not accurately capture properties for graphs with power- 
law degree distributions. Similarly, [24| shows that although node 
sampling appears to capture nodes of different degrees well, due to 
its inclusion of all edges for a chosen node set, the original level of 
connectivity is not likely to be preserved. 

Edge sampling (ES). Edge sampling focuses on the selection of 
edges rather than nodes to populate the sample. Thus, the node se- 
lection step in edge sampling algorithm proceeds by just sampling 
edges, and including both nodes when a particular edge is sam- 
pled. The partially induced graph is created just out of the sampled 
edges, which means no extra edges are added in addition to those 
chosen during the random edge selection process. Unfortunately, 
ES fails to preserve many desired graph properties due to the in- 
dependent sampling of edges. It is however more likely to capture 
path lengths, due to its bias towards high degree nodes and the in- 
clusion of both end points of selected edges. 



Topology-based sampling. Due to the known limitations of NS 
( |37| |24| ) and ES (bias toward high degree nodes), researchers 
have also considered many other topology-based sampling meth- 
ods. One example is snowball sampling, which selects nodes using 
breadth-first search from a randomly selected seed node. Snowball 
sampling accurately maintains the network connectivity within the 
snowball, however it suffers from a boundary bias in that many pe- 
ripheral nodes (i.e., those sampled on the last round) will be miss- 
ing a large number of neighbors (24). In |26| |, Leskovec et al. pro- 
pose a Forest Fire Sampling (FFS) method. It starts by picking a 
node uniformly at random, then'burns' a random fraction of its out- 
going links. The process is recursively repeated until no new node 
is selected or we obtain the sample size. In general, such topology- 
based sampling approaches perform better than NS and ES. 

None of the algorithms discussed above have been explicitly de- 
signed to work in a streaming fashion, as the emphasis has been 
largely on sampling representative subgraphs that matched the prop- 
erties of the original graph well. In the next section, we discuss our 
model of graph streams, and show how these standard sampling al- 
gorithms could be adapted to work in such a streaming setting. We 
also propose our new algorithm that outperforms simple streaming 
variants of these algorithms in the next section. 

3. STREAM SAMPLING 

We consider an undirected graph G(V, E) with a vertex set V = 
{v\, V2, «iv} and edge set E — {ei, e^, &m} where N is the 
number of vertices and M is the number of edges in G. We assume 
G arrives as a graph stream. 

Definition 3.1. We define a graph stream as a sequence of edges 
e 7r(i)) e ir(2)) ■••> e Tr(M)> where 7r is any random permutation on [M] = 
{1,2, ...,M}, 7T : [M] -> [M\. 

In traditional computational models of graphs, it is difficult to 
perform random access of the entire graph G at any step, since it is 
unlikely for large graphs to easily fit in the main memory. A stream- 
ing model, in which the graph can only be accessed as a stream of 
edges, arriving one edge at a time, is therefore more preferable j4j. 

In a streaming model, as each edge e £ E arrives, the sampling 
algorithm a needs to decide whether to include the edge or not as 
the edge is streamed in. The sampling algorithm a may also main- 
tain state "if, and consult the state to determine whether to sample 
a subsequent edge or not, but the total storage associated with ^ 
should be of the order the size of the output sampled graph G s , i.e., 
=0(\Gs\). Note that this requirement is potentially larger than 
the o(N, t) (preferably, polylog(N, t)) that streaming algorithms 
typically require [32]. But, since any algorithm cannot require less 
space than the output, we relax this requirement in our definition as 
follows. 

Definition 3.2. We define a streaming graph sampling algorithm 
as any sampling algorithm a that produces a sampled graph G s 
such that \ V 3 \/\V\ — <f>, which (1) samples edges of the original 
graph G(V, E) in a sequential order (i.e., not random access) in 
one pass; and, (2) maintains state ^ that is of the order of the size 
of the sampled graph G s , i.e., = 0(\G S \). 

Now, using the above definition of a streaming graph sampling 
algorithm, we now present streaming variants of different algo- 
rithms discussed in Section|2] 

3.1 Streaming Node Sampling 

One key problem with traditional node sampling we discussed in 
Section|2]is that nodes are selected at random. In our stream setting. 



new nodes arrive into the system only when an edge that contains 
the new node is added into the system; it is therefore hard to iden- 
tify which n nodes to select a priori. To address this, we essentially 
use the idea of reservoir sampling [40| and propose the following 
streaming node sampling variant (outlined in Algorithm[TJ. 

The main idea is to select nodes uniformly at random with the 
help of a uniform random hash function. Specifically, we keep 
track of nodes with n smallest hash values in the graph; nodes are 
only added if their hash values represent the top-n minimum hashes 
among all nodes seen thus far in the stream. Any edge that has both 
vertices already in the reservoir is automatically added to the orig- 
inal graph. Since the reservoir is finite, it can happen that a node 
that arrives much later may have a smaller hash value, in which 
case it replaces an existing node. All edges incident on that node 
are then removed from the sampled graph, as there is no chance for 
those edges to ever get sampled again. Thus, once the reservoir is 
filled up to n nodes, it will remain at n nodes, but at the same time, 
it will guarantee sampling from all portions of the stream (not just 
the front) since the selection is based on the hash value. 



Algorithm 1 Streaming NS(Sample Size n, Stream S) 

1: >V S = %E 3 =0 

2: > h is fixed uniform random hash function 
3: >t = 1 

4: for et in the graph stream S do 

5: t> (it, v) — e t 

6: if it ^ V a & h(u) is top-n min hash then 

7: V s = V s U u 

8: Remove all edges incident on replaced node 

9: end if 

10: if v ^ Vs&z h(v) is top-n min hash then 

11: V B = V s Uv 

12: Remove all edges incident on replaced node 

13: end if 

14: if u, v € V s then 

15: E s = E s U e t 

16: end if 

17: >i = t+l 

18: end for 

19: Output G s = (V 3 ,E S ) 



3.2 Streaming Edge Sampling 

Streaming edge sampling is a simple variant of the traditional 
edge sampling. Here, instead of hashing individual nodes, we focus 
on using hash-based selection of edges (as shown in Algorithm|2j. 
More precisely, if we are interested in obtaining m edges at ran- 
dom from the stream, we can simply keep a reservoir of m edges 
with the minimum hash value. Thus, if a new edge streams into 
the system, we check if its hash value is within top-m minimum 
hash values. If it is not, then we do not select that edge, other- 
wise we add it to the reservoir while replacing the edge with the 
previous highest top-m minimum hash value. A similar approach 
has been proposed by Aggarwal in [2]. However, in his work the 
goal was to get efficient structural compression of the underlying 
graph stream rather than getting a representative subgraph that can 
be used instead of the full graph. One problem with this approach 
is that our goal is often in terms of sampling a certain number of 
nodes n. Since we use a reservoir of edges, finding the right m that 
provides n nodes is really hard. It also keeps varying depending 
on which edges the algorithm ends up selecting. Note that sam- 
pling fraction could also be specified in terms of fraction of edges; 
the choice of defining it in terms of nodes is somewhat arbitrary 



in that sense. For our comparison purposes, we ensured that we 
choose a large enough m such that the number of nodes was much 
higher than n, but later iteratively pruned out sampled edges with 
the maximum hash values until the target number of nodes n was 
reached. While this is not strictly an elegant streaming algorithm, 
as we shall show in our evaluation, even this extra complexity does 
not result in producing good graph samples anyway. We include it 
mainly for comparison purposes. 



Algorithm 2 Streaming ES(Sample Size n, Stream S) 



1: 


>V S = <D,E S = <H 


2: 


t> h is fixed uniform random hash function 


3: 


> t = 1 


4: 


for et in the graph stream S do 


5: 


t> (u, v) = e t 


6: 


if h(et) is in top-m min hash then 


7: 


E s = E s U e t 


8: 


Vs = Vs U {it, v} 


9: 


end if 


10: 


Iteratively remove edges in E s such that n nodes. 


11: 


>t = t+l 


12: 


end for 


13: 


Output G s = (Vs,E s ) 



3.3 Streaming Topology-Based Sampling 

We also consider a streaming variant of a topology-based sam- 
pling algorithm. Specifically, we consider a simple BFS-based al- 
gorithm (shown in Algorithm [3} that works as follows. This al- 
gorithm essentially implements a simple breadth-first search on a 
sliding window of w edges in the stream. In many respects, this al- 
gorithm is similar to the forest-fire sampling (FFS) algorithm. Just 
as in FFS, it essentially starts at a random node in the graph and se- 
lects an edge to burn (as in FFS parlance) among all edges incident 
on that node within the sliding window. For every edge burned, let 
v be the other end of the burned edge. We enqueue v onto a queue 
Q in order to get a chance to bum its incident edges within the win- 
dow. For every new streaming edge, the sliding window moves one 
step, which means the oldest edge in the window is dropped and 
a new edge is added. (If that oldest edge was sampled, it will still 
be part of the sampled graph.) If as a result of the sliding window 
moving one step, the node has no more edges left to burn, then the 
burning process will dequeue a new node from Q. If the queue 
is empty, the process jumps to a random node within the sliding 
window (just as in FFS). This way, it does BFS as much as pos- 
sible within a sliding window, with random jumps if there is no 
more edges left to explore. Note that there may be other stream- 
ing variants of the sampling algorithm possible; since there are no 
streaming algorithms in the literature, we chose this as a reasonable 
approximation for comparison. This algorithm has a similar prob- 
lem as the edge sampling variant that it is difficult to control the 
exact number of sampled nodes, and hence some additional prun- 
ing needs to be done at the end (as shown in Algorithm [3}. 

3.4 Partially-Induced Edge Sampling (PIES) 

We finally present our main algorithm called PIES that outper- 
forms the above classes of streaming algorithms. In our approach, 
we mainly exploit the observation that edge sampling is inherently 
biased towards the selection of nodes with higher degrees, result- 
ing in an upward bias in the degree distributions of sampled nodes 
compared to nodes in the original graph |34|. However, in all sam- 
pled subgraphs, degrees are naturally underestimated since only a 
fraction of neighbors may be selected. This results in a downward 



Algorithm 3 Streaming BFS(Sample Size n, Stream S^Window 
Size=wsize) 

1: > V s = 0,-E s =0 
2: > W = 

3: > Add the first wsize edges to W 

4: > t = wsize 

5: D> Create a queue Q 

6: > m ^random vertex from W 

7: for et in the graph stream S do 

8: if u £ V s then add it to V s 

9: it W.incident_edges(u) ^ then 
10: Sample e from W. incident _edges(u) 

11: Add e = (u, w) to _E S 

12: Remove e from W 

13: Add v to V s 

14: enqueue v onto Q 

15: else 

16: if Q — then u ^random vertex from W 

17: Else it = Q .dequeueQ 

18: end if 

19: Move the window W 
20: if |V S | >nthen 

21: Retain [e] C E a such that [e] has n nodes 

22: Output G a = {V S ,E S ) 

23: end if 
24: > t = t + 1 
25: end for 

26: Output G s = (V s ,£ s ) 



Was, regardless of the actual sampling algorithm used. We also ob- 
serve that selecting nodes with high degrees results in samples with 
higher average clustering coefficient and shorter path lengths. It is 
likely that two interconnected sampled nodes will have the same 
neighbor if this neighbor is sampled and has an extremely large de- 
gree. Additionally, high-degree nodes are usually highly popular 
in the graph, they serve as good navigators through the graph and 
the shortest path is usually through those extremely popular ones. 
Therefore, sampling the high degree nodes can result in connected 
sampled subgraphs that accurately preserve the properties of the 
full graph. 

However, by sampling edges independently, it is unlikely that 
the structure of the graph surrounding the high degree nodes will be 
preserved. Thus, we also sample all the edges between any sampled 
nodes in the graph (graph induction). This helps to recover much 
of the connectivity around the high degree nodes — offsetting the 
downward degree bias as well as increasing local clustering in the 
sampled graph. Graph induction increases the probability that tri- 
angles will be formed among the set of sampled nodes, resulting in 
a higher clustering coefficient and shorter path lengths. The above 
observations, while simple, makes the sampled graphs approximate 
the characteristics of the original graph much more accurately, even 
better than topology-based sampling algorithms. 

Unfortunately, full graph induction in a streaming fashion is hard 
(i.e. since it requires at least two passes, when done in the obvious 
straightforward way). Thus, instead of total induction of the edges 
between the sampled nodes, we can utilize partial induction and 
combine the edge-based node sampling with the graph induction 
(as shown in Algorithm]?) into a single step. The partial induction 
step induces the sample in the forward direction, i.e., adding any 
edge among a pair of sampled nodes if it occurs after both the two 
nodes were added to the sample. 

PIES aims to maintain a dynamic sample as the graph is stream- 



Algorithm 4 PiES(Sample Size n, Stream S) 



1 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 

15: 

16 
17 
18 
19 

20 
21 



>V a = Q,E. = 9 
>t=l 

while graph is streaming do 

> (it, v) — et, 
if | V 3 1 <n then 

if u £ V s then V S = V S U {u} 

if v <£ V s then V s = V s U {v} 

E S =E S U {e t } 

else 

draw r from continuous Uniform(0,l) 
if r < p e then 

draw i and j from discrete Uniform[l,| V s \] 
if u £ V s then V s = V U {«} , drop node V 3 [i] 
with all its incident edges 

if v £ V s then V s = V s U {v} , drop node V s \j] 
with all its incident edges 
end if 

if u e V s AND v G V s then E s = E s U {e t } 
end if 
>t = t+l 
end while 

Output Gs = (Vs,E s ) 



ing utilizing the same reservoir sampling idea we have used before. 
Specifically, we add the first n records of the stream to a reservoir 
and then the rest of the stream is processed randomly by replacing 
existing records in the reservoir. PIES will then simply run over the 
edges in a single pass, adding deterministically the first n nodes of 
the stream to the sampled graph. Once it achieves the target sam- 
ple size, then for any streaming edge, it adds the incident nodes 
to the sample (probabilistically) by replacing other sampled nodes 
from the node sample set (uniformly at random). At each step, it 
will also add the edge if its two incident nodes are already in the 
sampled node set (to produce a partial induction effect). 

4. EXPERIMENTAL EVALUATION 

In this section, we evaluate the efficacy of the proposed stream 
sampling algorithms, PIES, NS, ES, and BFS, on several real datasets 
ranging from about 10,000 - 800,000 nodes, with 30,000 - 6.6 mil- 
lion edges. In our experiments, we consider five real networks: a 
citation network, a collaboration network, an email communication 
network, and two online social networks. Table 1 summarizes the 
characteristics of the (simplified) real networks. 

The two data sets called HepPH, and CondMAT, correspond to 
a citation graph, and collaboration graph respectively, provided by 
Leskovec et al. fl). The Facebook data corresponds to Wall com- 
munications among users that belong to a New Orleans city (39). 
The Twitter dataset contains tweets of users in discussion surround- 
ing the United Nations climate change conference in Dec. 2009. 
Also, the University email data corresponds to two weeks of email 
communication we collected from the email logs on Purdue univer- 
sity mailserver(s). We also verify our proposed approach on a large 
scale graph of 800,000 nodes and 6.6 million edges collected from 
Flickr network |15| . 

4.1 Evaluation Measures 

We compare four stream sampling methods from different sam- 
pling classes. We propose a one-pass implementation of node sam- 
pling (NS) and edge sampling (ES) to represent node-based sam- 



Dataset 


Nodes 


Edges 


No. CC 


Avg. path 


Density 


Clustering 


HepPH 


34 546 


420 877 


61 


4 33 


7 x 10 — 4 


146 


Twitter 


8,581 


27,889 


162 


4.17 


7 x 10~ 4 


0.061 


Facebook (NO) 


46,952 


183,412 


842 


5.6 


2 x 10~ 4 


0.085 


CondMAT 


23,133 


93,439 


567 


5.35 


4 x 10" 4 


0.264 


Email-PU Univ 


214,893 


1,270,285 


24 


3.91 


5.5 x 10" 5 


0.0018 


Flickr 


820,878 


6,625,280 


1 


5.01 


1.9 x 10~ 5 


0.116 



Table 1: Characteristics of Network Datasets 



pling and edge-based sampling classes respectively. We also pro- 
pose a one-pass breadth first sampling (BFS) to represent the topology- 
based sampling class. We implement BFS on a sliding window 
of 100 edges of the stream. Our evaluation is primarily along 
four main properties — degree, path length, clustering coefficient, 
and size of weakly connected components. We conjecture these 
four properties capture both local and global characteristics of the 
graph. We measure the performance of a sampling algorithm by 
how well the sampled subgraphs preserve the probability density 
function (PDF) and complementary cumulative distribution func- 
tion (CCDF) of each of these four properties. Unlike other mea- 
sure based on aggregate statistics (e.g., average degree, density, 
reciprocity), these four measures represent the distribution of prop- 
erties across the nodes and edges in the sample, which facilitates 
detailed comparison and evaluation of sample representativeness. 

In addition to visually comparing the similarity of the distribu- 
tions on the sampled subgraphs to those of the original graphs, 
we also compute two statistics to compare the distributions quan- 
titatively across different sampling fractions. First, we use the 
Kolmogorov-Smirnov (KS) statistic to assess the distance between 
two cumulative distribution functions (CDF). The KS-statistic is a 
widely used measure of the agreement between two distributions; 
the authors of [26 ] also have used the KS distance to illustrate the 
accuracy of FFS samples in the past. It is computed as the maxi- 
mum absolute distance between the two distributions, where x rep- 
resents the range of the random variable and F 1 and F% represent 
two CDFs. In this work, F\ represents the true distribution of the 
full graph and F2 represents the approximation of Fj calculated 
from the sampled subgraph. 

KS(F 1 ,F 2 ) = max x \F 1 (x)-F 2 (x)\ (1) 

We also used another statistical measure for evaluation called skew 
divergence, in order to measure the Kullback-Leibler (KL) diver- 
gence between two distributions that do not have the same con- 
tinuous support over the range of values (23). The results of skew 
divergence are similar to the KS statistic results, therefore we omit- 
ted them to save space. 

4.2 Results 

In our experiments, we focus on obtaining a sample between 5- 
40% (<f> = 0.05 to 0.40) of the full graph. For each sample fraction, 
we experiment with ten different runs, and in each run, we generate 
a sample from a new random seed. It is unlikely to assume a certain 
order of edges in the stream because usually social communication 
among users can happen in any arbitrary order. To simulate this 
aspect we randomly permute the edges in the graph in each run. 

KS-statistic. We compute the average of each of these measures 
across the five datasets and ten runs for each metric. Figures [T(a)f - 
|l(d)| show the average KS-statistic for degree, path length, cluster- 
ing coefficient and size of connected components, respectively. We 
observe that PIES outperforms BFS, NS, and ES for degree, path 
length, and clustering coefficient. NS comes in the second rank 



after PIES for the aforementioned measures. Both BFS and ES 
outperform PIES and NS on the size of connected components, but 
they do not perform well on the other measures. Overall, all sam- 
pling algorithms that include an induced graph step (PIES and NS) 
in their process perform well for the cases of degree, path length 
and clustering coefficient as they capture more edges between the 
sampled nodes. 

Distributions. We plot the distributions of the three metrics in Fig- 
ure[4]for Facebook (a-c), and Email Purdue university (d-f) at 20% 
sampling fraction. We picked the 20% sampling fraction as a rea- 
sonable sample size to show the difference between the distribu- 
tions of different sampling algorithms. However, other sampling 
proportions show similar relative behavior among the algorithms. 

Figures [2(a)| and |"2(d)| show the degree distribution for the two 
networks. From the figures, we can observe that NS under-estimates 
the degree of the nodes, resulting in a large fraction of zero-degree 
(low-degree) nodes in its sample across the two networks. Simi- 
larly, BFS and ES also capture a large fraction of low-degree nodes. 

Figures [2(b)| and |2(e)| show the path length distribution for the 
two networks, we observe NS samples have a high fraction of long 
path lengths compared to PIES since it samples low-degree nodes 
more than high-degree nodes. 

Figures |2(c)| and |2(f)| show the clustering coefficient distribu- 
tions. Across the two networks, NS, ES and BFS produce unclus- 
tered samples. PIES performs well for both Email and Facebook 
networks, however, it performs similar to NS on the HepPH net- 
work. 

Overall, PIES is the closest to preserving the three distributions 
compared to other methods. This is due to the fact that PIES sam- 
ples high degree nodes with a larger probability than NS. Similar to 
PIES, both BFS and ES select high degree nodes with a probability 
higher than NS. However, PIES outperforms BFS and ES since it 
adds extra edges between the sampled nodes (i.e. through partial 
induction in the forward direction). 

We omitted the plots for the size of weakly connected compo- 
nents due to the limited space, however, ES and PIES outperformed 
the other methods. 

In addition to analyzing the KS statistic as an average on all net- 
works, we also analyze the performance of PIES for each network 
in Figure [3] (average over all graph properties), sorting the net- 
works in increasing order from left to right in terms of their density 
and clustering. The results indicate that PIES performs better in 
datasets that are less dense/clustered. This is an interesting result 
that shows PIES will be more suitable to sample rapidly chang- 
ing graph streams that are more likely to have a lower density over 
time. 

Evaluation on different points of the stream. Further, Figures |4(a)| 
and |4(b)| show the KS statistics (average over all graph properties) 
of the different algorithms at different points in the stream while 
it is progressing. PIES performs better than NS, BFS, and ES on 
Facebook. However, PIES performs slightly better than other meth- 
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Figure 1: Average KS Distance across 5 datasets. 
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Figure 3: Average KS Statistics for different networks (sorted 
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Figure 4: Average KS Statistics at different points of the stream 



ods on HepPh. This also illustrates that PIES can maintain a con- 
sistently good random sample at different lengths of the stream. 

Back-in Time Goal. Leskovec et al. proposed the back-in time 
sampling goal |26| which corresponds to traveling back in time and 
capturing properties of the past versions of G at sizes n' < n. In 
this experiment, we investigate the question whether we can sample 
in a manner that allows us to match what the stream looked like in 
the past. This can help in studying the stationarity properties of the 



graph stream as it evolves over the time. Figure [5] shows the aver- 
age KS statistics (average over all graph properties) of the different 
algorithms when the goal is to approximate the graph stream back- 
in time when it was only 20% the size of the full stream. We again 
observe that PIES performs better than the other algorithms. We 
show the results only for Facebook and HepPh networks, however 
the same conclusions apply for the other datasets. 

Sampling from Very Large graphs. While sampling from small 
graphs (i.e. with thousands of nodes/edges) is important for many 
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Figure 5: Average KS Statistics when the goal is to match the 
graph back-in time at 20% of the stream 



applications, it is unrealistic for many other applications that deal 
with very large graphs with hundreds of thousands of nodes/edges. 
These large graphs are typically too big to fit into memory and 
therefore they are hard to process with existing sampling methods. 
Therefore, we also verified our proposed algorithm PIES on large 
scale graphs with 800,000 nodes and 6,6 million edges collected 
from Flickr network. As shown in figure [6] PIES sampled graphs 
are close to the properties of the larger Flickr network using only a 
single pass on the edges Q 

Comparison with non-streaming algorithms. Our goal is to ob- 
tain a representative sample from a stream that is either evolving 
over the time or too large to fit into memory. Here we compare 
PIES to other non-streaming sampling algorithms. We compare to 
Forest Fire Sampling (FFS) |26| and fully-induced edge sampling 
(ES-i). In the case of FFS, we use pj — 0.7 as in [26 ]. In ES-i, we 
first sample the edges with ES then we add all the edges among the 
sampled nodes in a second pass (full induction). 

Figure [7] shows the average KS statistic (average over all graph 
properties) for the five networks. Overall, PIES performs better 
than both ES-i and FFS. However, ES-i performs better for HepPh 
and ConMAT. This illustrates the effect of full induction versus par- 
tial induction for more dense networks. Since ES-i gets the chance 
to add more edges among the sampled nodes, it outperforms PIES 
on graphs with higher density/clustering. However, PIES performs 
better for the less dense, and clustered graphs. 

5. RELATED WORK 

Sampling from Graphs. The problem of sampling graphs has 




Note that for the Flickr data experiments, we compare PIES to the 
true distribution only, since the other baseline methods are ineffi- 
cient to run for very large graphs and they don't match the graph 
properties well on smaller sampling sizes. 
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Figure 7: Average KS Statistics for different networks (sorted 
in increasing order of clustering/density from left to right). 



been of interest in many different fields of research. The work in 
|24||42[[37) studies the statistical properties of samples from com- 
plex networks produced by traditional sampling algorithms such as 
node sampling, edge sampling and random-walk based sampling 
and discusses the biases in estimates of graph metrics due to sam- 
pling. The work in [29] also discusses the connections between 
specific biases and various measures of structural representative- 
ness. In addition, there have been a number of sampling algorithms 
in other communities such as in peer-to-peer networks |38| |14| . 
Internet modeling research community 1 20 1 and the WWW infor- 
mation retrieval community has focussed on random walk based 
sampling algorithms like PageRank |33||18| . There is also some 
work that highlights the different aspects of the sampling problem. 
Examples include f[9l|8"l|5) 

In social networks research, the recent work in 1 34 1 uses random 
walks to estimate node properties in G (e.g., degree distributions 
in online social networks). These different sampling algorithms 
focused on estimating either the local or global properties of the 
original graph, but not to sample a representative subgraph of the 
original graph, which is our goal. The work in |28| studied the 
problem of sampling a subgraph representative of the graph com- 
munity structure by sampling the nodes that maximize the expan- 
sion. 

Due to the popularity of online social networks such as Facebook 
and Twitter, there has been a lot of work f3T| [TT] [25] [2T| [7] p^l 
studying the growth and evolution of these networks. While most 
of them have been on static graphs, recent works |41[ [39 1 have 
started focusing on interactions in social networks. There is also 
work on decentralized search and crawling [10, 13, 22 1, however, 
in our work we focus on sampling from graphs that are naturally 
evolving as a stream of edges. In the literature, the most closely 
related efforts are that of Leskovec et al. in |26| and Hubler et 
al. in 1 17]. But, as we mentioned before, our work is different as 
we focus on the novel problem of sampling from graphs that are 
naturally evolving as a stream of edges (graph streams). 



Impact of Sampling on Other applications. Recently, some re- 
search has also focused on how the different sampling methods im- 
pact the performance of applications overlaid on the networks. One 
such study investigated the impact of sampling designs on the dis- 
covery of the information diffusion process |12) , Another study 
investigated the impact of the choice of the sampling design on the 
performance of relational classification algorithms |6|. 

Data and Graph Streams. Although significant work has been 
proposed to solve the problem of graph sampling, to our knowl- 
edge, there is no prior research on sampling from graph streams to 
obtain a representative subgraph. However, several research works 
|36| [9] 1 16| studied graph streaming algorithms for counting tri- 
angles, degree sequences, and estimating page ranks. The main 
contributions of these works are to use a small amount of memory 
(sublinear space) and few passes to perform computations on large 
graphs streams. In database research, some research studied data 
stream management systems. For example, the work in 1 30 1 stud- 
ied the problem of computing frequency counts in data streams, and 
the work in 1 3 1 studied the problem of sampling from data stream 
of database queries. 

6. CONCLUSIONS 

Much of the past efforts on sampling networks have assumed 
that the sampling algorithm can access the full graph in order to 
decide which nodes/edges to select. However, many large-scale 
network datasets are constructed from a graph stream consisting 
of micro-communications among users (e.g. wall posts, tweets, 
emails). In this work, we have formulated the problem of sam- 
pling representative subgraphs from such large graph streams. We 
proposed a novel sampling algorithm, PIES, that is based on com- 
bining edge sampling with partial induction. Our approach is not 
only simple and efficient, it is also amenable to a streaming imple- 
mentation. Furthermore, our empirical results show that PIES sig- 
nificantly outperforms other sampling algorithms, both streaming 
and non-streaming, across a range of real- world network datasets. 
In future work, we aim to study the theoretical properties of graph 
stream sampling in particular our proposed method. 
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